Databases and SQL

Effective Data Science: Acquiring and Sharing Data

Dr Zak Varty

What is a Database - concept

  • Collection of data frames (tables)
  • Named columns of same type (fields)
  • Analogy: sheets in a spreadsheet
  • Schema shows how these are related

CSV vs table vs data.frame

  • Stored on disk
  • Indexing
  • Optimised for row operations (records)
  • Designed for relational data

What is SQL?

Structured Query Language

  • A way to ask questions of a database
  • SQL is the ISO standard for doing so.
  • Many dialects / flavours depending on DBMS
    • SELECT, UPDATE, DELETE, INSERT, WHERE
    • DBMS specific extensions

Case Conventions

SQL is case instenstive. Conventions:

  • ALLCAPS for SQL verbs
  • Title Case for table names
  • lower case for field names

What is a Database - reality

Data base management systems

  • Client-server DBMS:
    • Central server, usually within organisation.
    • PostgeSQL, MariaDB, SQL Server, Oracle.
  • Cloud DBMS
    • Remote Server, storage as a service.
    • Snowflake, RedShift, BigQuery
  • In-process DBMS
    • Local machine
    • SQLite, duckdb


Q: When would you prefer each of these?

Installing SQLite

Mac OS

# Check installed
sqlite3 --version

Linux

#| eval: false
# Check installed
which sqlite3
# install if needed
sudo apt install sqlite3


Windows - use git for Windows to get unix-like command line.

Walkthrough

Antarctic Explorers

  • Based on Software Carpentry tutorial, dowload survey.db from blackboard.
# Move to where the database is stored locally
cd Work/teaching/00-effective-data-science/effective-data-science-resources/demo-classes/02-sql 

# Open interactive SQLite session
sqlite3 survey.db

# Close interactive session
.quit
.exit

#Get help 
.help

Learning about a database

-- List all tables
.tables

-- Column summaries
.schema


-- Get an entire table - BEWARE!
SELECT * FROM Person;

-- Get a single field 
SELECT id FROM Person;

Pretty Output

Our command line output could look better:

# left align output by column
.mode column

# show field headers
.header on


Try again

SELECT * FROM Person

Selecting Fields

Selecting a single field from a table

-- Get a single field 
SELECT id FROM Person;


Selecting multiple fields from a table

-- Get multiple fields 
SELECT id, id FROM Person;               -- Beware!

-- Records can change between repreated queries
SELECT id, Personal, Family FROM Person;
SELECT id, Personal, Family FROM Person; -- Beware!

Working with factors

Distinct levels

SELECT DISTINCT quant FROM Survey;


Distinct pairs or touples

SELECT DISTINCT quant, person FROM Survey;

Ordering

Records are not in a fixed order and may change between queries

# ascending order by default
SELECT * FROM Person ORDER BY id;
SELECT * FROM Person ORDER BY id ASC; -- for emphasis

# switch to descending
SELECT * FROM Person ORDER BY id DESC;


Q: What do you expect to see?

SELECT taken, person, quant FROM Survey ORDER BY taken ASC, person DESC;

Filtering

Filter with WHERE, combine logical conditons (=, >, <, …) with AND and OR.

SELECT person, reading FROM Survey WHERE Person = 'lake';
SELECT person, reading FROM Survey WHERE Person = 'lake' OR Person = 'roe';
SELECT person, reading FROM Survey WHERE Person IN ('lake', 'roe');


-- Be explicit about order of operations
SELECT person, reading FROM Survey 
    WHERE (person = 'lake' OR person = 'roe') AND reading > 10;

SELECT person, reaing FROM Survey 
    WHERE person = 'lake' OR (person = 'roe' AND reading > 10);

Q: Which matches the case with no parentheses?

Wildcards

LIKE and the % wildcard.

Combined, these behave like * in regex.


SELECT * FROM Visited WHERE site LIKE 'DR%';


Q: Is the following true?

 'data science' LIKE '%a%'

Calculating

-- Horrible field title
SELECT 1.05 * reading FROM Survey WHERE quant = 'rad';

-- Nicer field title
SELECT 1.05 * reading as radiation FROM Survey WHERE quant = 'rad';

Joining Tables (Basics)

Join makes a cross product with 3 * 8 = 24 entries

SELECT * FROM Site JOIN Visited;


Tell SQL how to match the tables by specifying a primary key and secondary key


SELECT
  *
FROM
  Site
  JOIN Visited ON Site.name = Visited.site;

Join and Select

Notice the Table.field synax, this aviods issues with duplicate column names.

SELECT
  Site.lat,
  Site.long,
  Visited.dated
FROM
  Site
  JOIN Visited ON Site.name = Visited.site;

Joins can become more complicated

SELECT
  Site.lat,
  Site.long,
  Visited.dated,
  Survey.quant,
  Survey.reading
FROM 
  Site
  JOIN Visited
  JOIN Survey ON Site.name = Visited.site
  AND Visited.id = Survey.taken
  AND Visited.dated IS NOT NULL;

Learning More

  • Other types of join: inner, left, right
  • Lots of practise with {dplyr}
  • Write your queries in R using {dbplyr}
  • Short tutorial
  • SQL for Data Analysis book

This Week

Acquiring and Sharing Data Checklist